Fast microarray expression data analysis method for network exploration

ABSTRACT

A method for performing network reconstruction is provided. The method includes the steps of selecting a predictor set of features, adding a complement to the predictor set based on a quality of prediction, checking to see if all of the features of the predictor set are repeated, and then removing one feature from the predictor set. The algorithm and method repeat the steps of adding a complement, checking the predictor set and removing a feature until the features of the predictor set are repeated. If the features of the predictor set are repeated, the algorithm and method terminate.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a divisional application of co-pending and commonlyassigned U.S. patent application Ser. No. 09/779,240 entitled “FASTMICROARRAY EXPRESSION DATA ANALYSIS METHOD FOR NETWORK EXPLORATION”, thedisclosure of which is hereby incorporated herein by reference.

BACKGROUND OF THE INVENTION

The micro-array was developed in 1995 to measure gene data in amassively parallel fashion. Since that time a significant increase inthe amount of data per experiment has occurred (Seehttp:/www-binf.bio.uu.nl˜dutilh/gene-networks/thesis.html). In the caseof gene exploration, this extensive data is important for use inassessing genes and their influence on expressed states in organisms. Inparticular, it is necessary to assess the function and operation of aparticular gene; a gene being defined as a prescribed area on a nucleicacid molecule that is necessary for the production of a final expressedphenotype of an individual. On a more complex and broader scale, theinteraction network is also of interest due to its influence inregulating higher cellular, physiological and behavioral properties.Recent attempts are being made to reconstruct the precise interactionnetwork or its fragments based on large-scale array experiments for acondition-specific database, e.g., melanoma (Bittner et al., 2000). Thecritical first step in these efforts is to find the smallest subset of(predictors), within a desirable degree of precision, related to anarbitrary target. Based on such set of predictors, computed for everytarget of interest, it is possible to find the smallest set that canexplain or predict behavior of any target in terms of expression. In thecase of genes, finding the smallest set to predict a prescribed behaviorcould be a very complicated and arduous task given the massive amount ofdata that results from analyzing a complete organism's genome.

Most important to scientists is the ability to select a minimalcardinality set that can represent a whole set of expressed information.In the pattern recognition literature, this is known as featureselection or dimensionality reduction, depending on the context.

The issue at hand now is a question of mathematics and computationrather than pure biology. In particular, the specific problem at focushas been addressed from a computational standpoint. A number ofalgorithms can be applied from other fields and areas of study to helpsolve this arduous task. The specific problem at focus, from acomputation standpoint, is to find the best (with respect to a givenquality of function) k-tuples, from a set of n features, for many valuesof k. One method to find the best k-tuple predictor subset is to conductan exhaustive search through all possible k-tuples. Although thisapproach always leads to the best solution, it becomes intractable foreven moderate values of k (the computational time grows exponentiallywith k).

Also important to bio-informatics will be the methods developed forpattern recognition. In the context of pattern recognition, machinelearning, data mining and their applications to various subject areas,e.g., medical diagnostics, manufacturing test design, image recognitionetc., a similar problem of subset selection, known as feature selectionis important. A number of approaches have been proposed and designed toaddress these problems or issues. The approaches include and are notlimited to, sequential (backward and forward) search techniques,floating search techniques and genetic or protein algorithms. However,methods based on the sequential search techniques suffer from thenesting effect, i.e., they overlook good feature sets consisting ofindividually poor quality features. A second method called the floatingsearch methods (Pudil et al., 2000; Somol et al., 2000) attempt to avoidthe nesting problem by successively adding the best and removing theworst subsets of features from the candidate set of features. Thisintroduces an exponential complexity in the search when the size of asubset grows. A significant drawback of these methods is that theybecome slow for large dimensional data sets as is the case withbiological expression data. Genetic or biological algorithms also do nothave well defined stopping criteria and, in principle, can beexponentially complex.

Most importantly, the above methods and algorithms are intended to beapplied in the field of array data processing to enable computationallyefficient searches for the smallest subset of features that predict atarget's expression levels within a given level of confidence oraccuracy.

It would, therefore, be desirable to develop a method and algorithm thatdetermines “good” solution sets with high probability in linear time,with respect to total number of features in a predictor set. For thisreason, “sequential forward selection” (SFS) (Bishop, 1997; Pudil etal., 2000; Somol & Pudil, 2000) was developed to add the best (the onethat leads to the largest improvement in the value of the qualityfunction) new feature, at each successive stage of the algorithm, to thecurrent set of features until the needed number of features is reached.It follows from construction that SFS suffers from the nesting problemand always overlooks better solutions sets whose features are ofmediocre or poor quality. This is one of the shortcomings addressed bythe present invention. While “sequential floating forward selection”(SFFS) also addresses the nesting problem, it maintains exponential timecomplexity for large data sets. The second shortcoming that thisinvention addresses is the exponential time complexity to find “good”solutions. The proposed method and invention finds a “good” solution setwith high probability in linear time with respect to number ofpredictors. One of the floating search algorithms, called “oscillatingsearch”, (Somol & Pudil, 2000) can also find approximate solutions inlinear time. However, the present invention and method guarantees anequal or better quality solution while maintaining the linear timecomplexity. In addition, the same generic method or algorithm can beused not only for gene network reconstruction, but also can be appliedto protein data, feature selection for classification and otherbiological data that is very large and complex to organize and analyze.

SUMMARY OF THE INVENTION

The invention provides a method for determining a predictor set offeatures associated with a target. The method comprises the steps ofselecting a predictor set of features, adding a complement to thepredictor set based on a quality of prediction, checking to see if allof the features of the predictor set are repeated and then removing onefeature from the predictor set. The algorithm and method repeats thesteps of adding, checking and removing features until the features ofthe predictor set are repeated. If the features of the predictor set arerepeated, the algorithm and method terminate.

More specifically, the invention is a method for probabilisticallydetermining a subset of features of size k that are closely related to agiven target in terms of a selected quality function. The method of theinvention operates by allowing a user to select a target of choice andthe size (k) of the predictor set. Once a target has been selected, themethod starts by selecting an arbitrary (ordered) subset of features ofsize k−1 and iteratively adds and removes single features (in order)from the selected subset. This process is iterated until a subset offeatures of size k is found whose quality of prediction of the targetcan no longer be improved by the process of deletion followed byaddition of a feature. The algorithm terminates at this stage. In eachiteration, the comparisons are based on a quality function thatdetermines a quality of prediction associated between the predictors andthe target. The method of invention can easily be extended toprobabilistically determine the smallest (in size) subset of featuresthat are closely related to a given target within an a priori settolerance level in terms of a selected quality function. The method thentakes as input a given target (user selected) and iteratively appliesthe method of invention for subsets of size 1, 2, 3, . . . , k, etc.,until a predictor set that is closely related to the target expression,within the a priori set threshold, is found. The method can also be usedfor classification of experiments. The method in this case defines atarget in terms of a vector of numbers representing the class ofexperiments under consideration. The method can then be used to identifya subset of features which can classify the data.

The foregoing has outlined rather broadly the features and technicaladvantages of the present invention in order that the detaileddescription of the invention that follows may be better understood.Additional features and advantages of the invention will be describedhereinafter which form the subject of the claims of the invention. Itshould be appreciated by those skilled in the art that the conceptionand specific embodiment disclosed may be readily utilized as a basis formodifying or designing other structures for carrying out the samepurposes of the present invention. It should also be realized by thoseskilled in the art that such equivalent constructions do not depart fromthe spirit and scope of the invention as set forth in the appendedclaims. The novel features which are believed to be characteristic ofthe invention, both as to its organization and method of operation,together with further objects and advantages will be better understoodfrom the following description when considered in connection with theaccompanying figures. It is to be expressly understood, however, thateach of the figures is provided for the purpose of illustration anddescription only and is not intended as a definition of the limits ofthe present invention.

BRIEF DESCRIPTION OF THE DRAWINGS

For a more complete understanding of the present invention, reference isnow made to the following descriptions taken in conjunction with theaccompanying drawing, in which:

FIG. 1 illustrates a schematic view of the present invention in vectorformat showing the target and the predictors.

FIG. 2 shows a block diagram of the method of the invention.

FIG. 3 shows how the GSSA algorithm makes comparisons of data.

FIG. 4 shows a simulated plot of the number of attractors v. the numberof experiments.

FIG. 5 shows a simulated plot of the probability of finding an optimalattractor v. number of genes.

FIG. 6 shows a simulated plot of the execution time v. number of genes.

FIG. 7 shows a simulated plot of the execution time v. number ofpredictors.

FIG. 8 shows a simulated plot of the log of the execution time v. numberof genes.

DETAILED DESCRIPTION

Before describing the present invention in detail, it is to beunderstood that this invention is not limited to specific compositions,process steps, or equipment, as such may vary. It is also to beunderstood that the terminology used herein is for the purpose ofdescribing particular embodiments only, and is not intended to belimiting. The invention has broad based use and application in a varietyof fields including most importantly the fields of chemistry,biochemistry, computer science and biology.

It must be noted that, as used in this specification and the appendedclaims, the singular forms “a”, “an” and “the” include plural referentsunless the context clearly dictates otherwise. Thus, for example,reference to “an attractor” includes more than one attractor, referenceto “a predictor” includes a plurality of predictors and the like.

In describing and claiming the present invention, the followingterminology will be used in accordance with the definitions set outbelow.

The term “feature” shall mean expression levels or biological data of adefined gene, protein, or other biological function or component underconsideration and over a prescribed number of experiments. It is alsothe smallest sized component or item that can not be further reduced ina predictor set.

The term “network reconstruction” shall mean the process, apparatus,steps and algorithms used in determining associated and/ornon-associated pathways in a data set. In particular, it shall mean therelationships or pathways between two or more predictor sets.

The term “target” shall have a broad based meaning to include, proteins,genes, immunological information, feature selection for classification,and other complex biological and chemical data and or components thatmay be defined over a number of experiments. The term has particularmeaning to chemical and biochemical data that is extensive, complex anddifficult to analyze. For instance, in the case of genes it shall meanthe expression levels over the number of experiments of a selected geneof interest.

The term “predictor set” shall have a broad based meaning to include,proteins, genes, immunological information, feature selection forclassification, and other complex biological and chemical data and orcomponents for a given size k, that are used to compute or predict anassociated quality or characteristic of a target. For instance, in thecase of gene vectors of a given size, k, it is used to compute orpredict an expression level of a target gene vector.

The term “predictor(s)” is used for a feature that is part of apredictor set.

The term “prediction” is a vector, computed by using a linear/non-linearfunction of features in the predictor set (although, we describe theproposed invention in terms of linear prediction function, the method isnot limited to the same; it can be extended to other predictionfunctions such as non-linear, etc.).

The term “quality” or “quality of prediction” shall herein mean thedistance between the predictor and the target. The smaller the distancebetween the predictor and the target the better the quality ofprediction. Geometrically, for k=2, quality is the distance between thetarget and the plane formed by the two features in the predictor set. Itshould be noted that this definition of the quality function should notbe interpreted as limiting and any other computable function may beused.

An “attractor” shall mean a set of a given size k of features such thatthe quality can't be improved by replacing one feature in the set. Inother words, in the case of gene data, a set of genes S is an attractorif the quality of its prediction of the target gene, G, cannot beimproved by substituting only one gene in this set (in some sense anattractor is a local minima). It should be noted that the best solutionis always an attractor.

The term “complement” shall mean a feature (when looking for a predictorset size of k features it is defined as follows: a feature g is called acomplement to a given set of k−1 features if no other feature, alongwith this given set of k−1 features, can form a higher quality set of kpredictors). In the case k=2, gene g* is called a complement to featureg if the “quality” Q(.,.) of the couple (g,g*) is no worse than that ofany couple (g,h); Q(g,g*)≦Q(g,h) for any h.

The term “k-tuples” shall mean a group of size k. For instance, if k=2,then we call them a “couple”. In addition, if k=3 we call them a“triplet”. This is an abbreviated term to show the relationship betweengroup size and designated pairings.

The term “good solution” shall mean a set of predictors having a givensize with high enough quality.

The term “M-dimensional space” shall mean a variety of orientations andpositions in space (i.e., M=1 . . . 1000, arbitrary and large).

When a search for the best (in quality) group of individuals isconducted, the term “nesting” or “nesting effect” shall mean proceduresthat are based on the assumption that a good in quality group consistsof good in quality subgroups, overlooking solutions made up of mediocreor poor (in quality) individuals.

The term “clustering” or “cluster” shall mean associations made betweenpreviously non-associated groups or features. Clusters are based uponpotential pathway interactions rather than just upon similar expressionactivity. Use of the method or technique in this way can also allow theidentification of genes, proteins or similar type molecules, which areinvolved in predictor sets for many targets, again leading to pathwayidentification.

The array is a significant new technology aimed at providing a top downpicture of the intimate processes of an organism. Whether an array isimplemented using photolithography and silicon-based fabrication,capillary printing heads on a glass slide, or ink-jet technology, itallows for quantification of large amounts of data simultaneously. Fewyears have passed since the first micro-array based biological resultswere published and it already seems unthinkable to tackle thecomplexities of the workings of the cell without these devices. However,there remains unsolved image processing as well as computational andmathematical difficulties associated with the extraction and validationof data for micro-array assays.

Network reconstruction's main function is to discover the existence anddetermine the association of multiple dependencies between biological ortranscriptional data that leads to the identification of possibleassociated pathways. There are a number of potential motivations forconstructing algorithms, and computer software for networkreconstruction. For instance, diagnostics and the diagnostic industrycan use these techniques and tools for application in diseaseidentification. Other potential valuable applications include treatmentselection and outcome prediction. In addition, the derived informationcan be further applied to aid in drug development and areas of featureselection for classification.

The references cited in this application are incorporated in thisapplication by reference. However, cited references are not admitted tobe prior art to this application.

Feature Selection in Pattern Recognition

The problem of network reconstruction, based on micro-array data, can bereduced to finding dependencies within a subset of features in terms oftheir expression levels. One meaningful option to address this problemis to find a set of best k predictors for any target of interest.

In the context of pattern recognition, machine learning, data mining andtheir applications to various subject areas, e.g., medical diagnostics,manufacturing test design, image recognition, etc., a similar problem ofsubset selection, known as feature selection is faced. We have discussedthis related work in feature selection and the advantages of theproposed method over these works in the Section “Background of theInvention”. We concentrate here on the method of invention.

The method described below is somewhat similar to what are called the“sequential forward selection” (SFS) and “sequential floating forwardselection” (SFFS) methods. SFS adds the “best” (i.e. the one that leadsto the largest improvement in the value of the quality function) newfeature, at each successive stage of the algorithm, to the current setof features until the needed number of features is reached. Inparticular, SFS suffers from the nesting problem and always overlooksbetter solution sets whose features are of mediocre or poor quality.This is one of the shortcomings addressed by this invention. While SFFSalso addresses the nesting problem, it maintains exponential timecomplexity for large data sets. The second shortcoming that thisinvention addresses is the exponential time complexity to find “good”solutions. The method finds a “good” solution set in linear (withrespect to k) time with high probability. One of the floating searchalgorithms, called “oscillating search”, can also find an approximatesolution in linear time. However, the method guarantees an equal orbetter quality solution while maintaining the linear time complexity.The methods and algorithm of the present invention may be used andemployed in a variety of systems that receive or export large volumes ofdata or information. In particular, the algorithm and/or method havepotential application with computer and/or computer software. Thealgorithm may be used, employed or supplied with other similar hardware,software, or systems that are well known in the art.

Referring now to FIG. 1, the first step in a network reconstructionentails the search for strong linear dependencies among associated data,i.e.:G˜F(g, h . . . )   (1)where G 120 is the target of interest and g 140, h 150 are predictors.The function F can be linear or non-linear. In this application, weinvestigate only the linear case:F=α*g+β*h   (2)where α and β are constants.

The quality of a linear prediction of a target G 120, associated with aset of features, g 140, h 150, is given by the quality function Q 110:$\begin{matrix}{{Q\left( {{G;g},h} \right)} = {\min\limits_{a,b}{\sum\limits_{I = 1}^{m}{\left( {G_{i} - {\alpha\quad g_{i}} - {\beta\quad h_{i}}} \right)^{2}/{\sum\limits_{I - 1}^{m}G_{i}^{2}}}}}} & (3)\end{matrix}$

FIG. 1 shows a representation of the g 140 and h 150 vectors in 3dimensional space. Each vector is defined by three components x, y andz, which are the expression levels of a feature over a set of threeexperiments. G, h and g are three arbitrary vectors in 3-dimensionalspace and the figure is for illustration purposes only and should not beinterpreted as limiting the scope of the invention in any way. Inaddition, the use of x, y, and z coordinate systems and the describedplanes should not in any way be interpreted as limiting the broad scopeof the invention. The quality of a given pair g 140, h 150, aspredictors of G 120, is the distance between the plane formed by thesetwo vectors and target vector G 120 (See FIG. 1 for more details). G^(P)125 is defined as the projection of G 120 on the plane 130 formed by g140 and h 150.

Greedy Subset Search Algorithm (GSSA)

Referring now to FIGS. 1-3, the algorithm starts with an ordered,randomly selected subset of k−1 features. Next, this subset iscomplemented by one feature, which along with selected features formsthe best possible (in quality) subset of k features. Number k isassigned to this feature. The feature number one is removed from the setand numbering of remaining genes: 2, . . . , k is reduced by one tobecome 1, . . . k−1. This subset of k−1 features is an input to the nextiteration of the loop 315 (See FIG. 3).

Iterations continue until the quality of a set of k predictors can notbe further improved by replacing any one feature in the set 325 (SeeFIG. 3).

Method Outline: The Algorithm

Let S={g₁ . . . , g_(n)} be expression levels of N features observedover the course of M experiments and let G 120 be an arbitrary targetwith expression levels over the same group of experiments. In this way,all features are represented by vectors in M-dimensional space. G 120may or may not belong to S 200 (not shown in diagrams). In what follows,we will refer to every element of S 200 as a feature.

General Definition

We say that a subset, s ⊂ S 200, of features predicts the target G 120,with accuracy 5 if the distance between the linear subspace generated bys and G 120 equals δ.

The Euclidean distance has been selected as a measure of proximity.Given this, and the linearity of the subspace, the definition above issimply stated as follows: the Least Squares distance between the subsets and the target G 120 equals: $\begin{matrix}{\delta = {\min{{{\underset{{g\quad j} \in S}{G} - {{\Sigma\alpha}_{j}g_{j}}}}/{G}}}} & (4)\end{matrix}$where min is taken over all sets of k real numbers α_(j) and ∥ ∥represents the norm in M-dimensional Euclidean space. The algorithm, inits current implementation, doesn't take advantage of the specificity ofthe distance definition and, therefore, can be applied to any searchproblem of described nature with an arbitrary proximity measure. Thevalue of δ will be referred to as the quality Q(s) 110 of a set ofpredictors s.

FIG. 2 shows a block diagram of the method of the present invention andhow the algorithm works with the predictor and target data for a givensize k of the predictor set. A target 120 of interest is first selected(not shown in the block diagram, but shown in FIG. 1). The method thenselects an ordered subset of features of size k−1. If k=2, then themethod randomly selects a feature, say g 140 (shown as reference number200 in the block diagram). To this subset the method now adds thecomplement feature, i.e., to form the best subset of size k (in terms ofquality of prediction) that has the initially chosen subset of k−1features. The quality of prediction of this subset of k features is thennoted (shown as reference 220 in the block diagram). Next, the algorithmperforms a checking step to see if the same set of k features haveappeared k times in a row (shown as reference number 225). If the answeris “yes” the algorithm stops and outputs the set of k features as theresult. This is called an attractor (shown as reference numeral 230 inthe diagram). If the answer is “no” the algorithm removes in order onefeature from the predictor set (shown as 240 in the block diagram). Thealgorithm continues and repeats the steps of adding a complement,checking the predictor set and removing a feature until the same set ofk features has appeared k times in a row.

This process (of deleting a feature and adding another feature) may berepeated many times until a subset is reached whose quality ofprediction can not be improved by deleting and then adding a singlefeature. This subset of size k (k=2 here) is referred to as an“attractor” 230 (shown generally as 230 is the block diagram of FIG. 2).The method then terminates and outputs the “attractor” as the best (interms of quality of prediction of the target expression) subset of sizek (k=2 here). The process can be modified, to probabilistically selectthe smallest (in size) subset of predictors that is closely related tothe target in terms of the quality of prediction, as follows. The methodstarts with an initial predictor set size of k=2 as described above. Ifthe “attractor” set that the method outputs does not lie within theacceptable threshold of quality of prediction, the predictor set size isincremented by one, i.e., k=3. The algorithm uses the “attractor” of theprevious stage (k=2) as the starting subset of size k−1 at this stage.The method is iteratively applied with increasing value of k, until an“attractor” is found that is related to the target within the a prioriset threshold of quality of prediction.

Referring now to FIG. 3, the definition of “attractor” and “complement”will now be clarified. FIG. 3 shows how a feature g 310 is firstcomplemented by g* 320. g* 320 is a “complement” to g 310 implies thatthe set consisting of features g 310 and g* 320 has the best quality ofprediction of all sets of size k=2 having feature g 310. Feature g 310is then removed and a new “complement” feature g(²⁾* 330 is added to g*320 and so on until the final predictor set (g ^((k−1))*, g^((k))*) 325is found, such that the “complement” of g^((k))* is the same as g^((k−1))* (comparisons are shown by reference numeral 315). At thisstage no deletion or addition of a single feature can improve thequality of prediction. Such a predictor set that can no longer beimproved is called an “attractor”.

Definition of Complement(s):

Referring to FIG. 3, feature g* 320 is called a complement to a set,i.e. g*=(s)* if the quality of the set s ∪ g* is better or equal to thequality of s ∪ g for any choice of g 310, i.e.,∪ g*)≦Q(s ∪ g)   (5)Definition of Attractor(s):

A set of features s is called an attractor (with respect to thecomplement) if (s′)* ∪ s′=s for any subset s′ of s that contains all butone element of s, i.e., one can not improve the quality of a set ofpredictors s by substituting only one feature (See FIG. 3).

The Problem and Algorithm:

For a given number k=2,3 . . . , a set of n features S, and a target G120, to find the best quality subset of features s of size k. To solvethe problem a random algorithm has been designed. The main loop of thisalgorithm is defined as a cycle chosen for one random seed set s′. For agiven k, the solution starts with the randomly selected seed set ofgenes s′ of the size k−1. One cycle of the main loop: Let s [i]represent the i^(th) element in s. Set index = 0; Randomly generate s'of size k−1; Set current best attractor, s = φ (empty set); Repeatunconditionally {

Find the complement g* of s′, i.e., g*=(s′)*;/* there is always assumedseniority order in s, i.e., later complements have greater index, e.g.,g*=s[k]*/   If(s' ∪ g* = s)then   {     s=s' ∪ g*;     index = index +1;     if (index = k) then break out of the loop;   }   else /* find anew set of better quality than s*/   {     index = 1;     s = s' ∪ g*; }  Form new seed set s' as s without its first element s[1]; } Print sets as the predictor set;

The number of cycles in the main loop can be controlled by (i) thequality of a current set of predictors; (ii) total computational time;and (iii) direct constraint on the number of cycles itself. From apractical point of view, the above methods are suitable for parallelprocessing. Since each loop begins with a random subset of data, allthat is necessary for the computation is to allow the method to generatethese random subsets, and compare with previously computed best subsets.Therefore, different initiations of the algorithm can be distributedacross processors, as long as different random seed subsets are used ineach loop and the computed best subsets are appropriately compared.

The algorithm above has been provided so that one of ordinary skill inthe art can design and write code from it. Java software was used tocode the algorithm in experimental runs. Experiments and data were runon a Personal Computer (PC) with a Windows NT (operating system). Thesystem had 512 MB RAM and a 800 MHz CPU (processor). One of ordinaryskill in the art needs to use a programming language to make a softwareprototype of the algorithm. He/she needs to have a computer that has thesoftware coding environment loaded. No special hardware requirements arenecessary to run this algorithm. The larger the size of RAM and the moreprocessing power (parallel processors would be even better), the better.Methods in exhaustive searches are also well known in the art and neednot be described here in detail.

Algorithm performance:

It can be shown that one cycle of the main loop always terminates andthe predictor set it terminates at is an “attractor”. It is obvious thatthe best (in quality) set of genes must be an “attractor” as well.Therefore, given everything equal, the algorithm's performance dependson the number of “attractors”; the less the number of “attractors” themore likely the loop terminates at the best “attractor”.

If π is the probability that one cycle of the main loop terminates atthe best “attractor”, then the probability that after P iterations ofthe main loop, the best “attractor” will be visited at least once is:1−(1−π)^(P)   (6)

It has been observed in numerous computations using both generated andreal data that as the number of “attractors” goes down when the numberof experiments (dimensions of space) goes up.

In other words, as more independent experiments are conducted, thereliability of the data improves. Since the number of “attractors”reduces with increasing number of experiments, the probability offinding the best in quality (optimal) “attractor”, using the GSSAalgorithm, increases.

The probability, π, of finding the optimal “attractor”, however, goesdown with the increase in the number of genes, N, and the increase inthe predictor set size, k. This is intuitive, since the number ofpossible “attractors” will increase with an increase in the number ofgenes or the size of the predictor set.

The figures described and illustrated below were plotted using thesoftware Matlab for PC. The PC was running Windows NT. Other methodswell know in the art can be used also to plot the data. Matlab was usedbecause of its simplicity in requiring (x and y pairs) of data forplotting. These and other plotting methods are well known in the art.However, the described applications and plots should in no way belimiting of the broad scope of the invention. The program can be codedto run on other computers and different software may be used to plot theresults. FIGS. 4-8 show plots of data from a series of runs on the GSSAalgorithm. The plots show the application of the algorithm and method togene reconstruction.

FIG. 4 shows a plot of (Nattr) plotted against Nexp. The plot shows thenumber of attractors of size 2 (i.e., k=2) as dimensionality of the dataincreases. Each gene is represented as a vector in m-dimensions. Thesem-dimensions are the m experiments that were conducted on all the genesin the set S. This simulation consists of expression of 200 genes in 40experiments. Thus, size of the set S is 200, and m=40. The graph showshow the number of attractors of size 2 (k=2) decreases as moreexperimental values are considered. It shows that as we use moreexperiments, we have less number of attractors and hence, a betterchance at reaching the best solution in a few random runs of the GSSAalgorithm. The use of the algorithm provides a special blessing ofdimensionality toward a final solution. As has been described above, thenumber of “attractors” can actually be controlled by the number ofexperiments. In addition, the number of “attractors” can be estimated apriori (proof beyond the scope of the invention). Existence of multiple“attractors” for a given data set may be evidence of insufficient amountof experiments to draw confident conclusions regarding the nature ofdependencies among genes. It also suggests a specific number ofadditional experiments to be performed to substantiate the conclusion.

Another point to be noted is that there is a tradeoff between the numberof genes replaced at each step of the algorithm and the quality of“attractor”. For instance, the computational time increasesexponentially as the number of replaced genes goes up, but the algorithmmay stop at a better quality “attractor”. However, one gene replacementalgorithms converge to a high “quality” attractor with high probability,given a sufficient number of experiments.

FIG. 5 shows a simulated plot of the probability of finding the optimalpredictor set in a single run of the approximate algorithm. As can bedetermined from the diagram, as the number of genes increases theprobability in most cases decreases in finding the actual predictor set.FIG. 5. shows the probability to find the best solution (best attractorof size k=numP in the graph) from a set of N (x-axis of the graph) genesin one run of GSSA. One run of GSSA means, starting with one random seedand running the GSSA algorithm until an attractor is found. Thedimensionality of the data here (the number of experiments or m wasfixed and was 38). Three plots are shown in this figure. The top plotshows the results for k=numP=2. The middle plot shows results fork=numP=3 and the bottom plot shows results for k=numP=4. The top plotand other plots are similar. The top plot (k=numP=2) shows that theprobability of finding the best solution drops as we increase the numberof genes. This is because there exist more attractors as the number ofgenes increases. Now the difference in the three plots also shows thatas we keep the number of genes fixed (take a fixed value on the x-axis),but increase the value of k (i.e., increase k from 2 to 4), we see thatthe probability of finding the best solution also decreases (and thistoo is because the number of attractors increases as the size of kincreases). If π is the probability of reaching the best solution in onerun of GSSA, then the probability that it reaches the best solution in ptrials is 1−(11−π)^(P) which can be brought close to 1 if we increase p.

FIG. 6 shows a simulated plot that demonstrates the computationalcomplexity of the approximate algorithm increases linearly as opposed toan exponential increase using an exhaustive search. The diagram shows aplot of execution time vs. number of genes. It becomes evident that asthe number of genes increases, the execution time increases in a linearfashion. The fact that the execution times (as the number of genesincreases) lie on a line indicates the linear nature of the algorithm.FIG. 6. shows the execution time of running GSSA 50 times versus thetotal number of genes. We take 50 random start seeds and run the GSSAuntil it reaches an attractor for each of the 50 instances. Then weselect as the best answer, the best result of the 50. The size of set Sof genes from which to choose the predictors is increased in theexperiment. We see that the execution time increases linearly. The threeplots show for three cases (k=numP=2, 3,4). The exhaustive search thatfinds the best solution has an exponential increase in time.

FIG. 7 shows a simulated plot of the execution time of the algorithm asa function of the number of predictors. The results of the plot indicatethat as the predictor set size increases, the execution time alsoincreases, but in a linear fashion. In other words, the execution timedoes not increase at the same exponential rate as exhaustive search whenthere is an increase in the number of predictors. This is very importantfor calculations necessary in the gene network reconstruction. What maytake years to complete (due to the exponential nature of exhaustivesearch) may now be completed in a matter of few minutes. FIG. 7. keepsnumber of genes fixed (S is fixed) and varies k=numP and plots theexecution time for 50 runs of the GSSA algorithm. The execution timescan be divided by 50 to yield time to run for a single run of GSSA (takeone random seed and run GSSA until you reach an attractor). Again theexecution time increases linearly with increase in k. Two plots areshown for size of S=10, and 20. Note that the execution times for FIG.6. and FIG. 7. are for the algorithm implemented in Java and running ona PC running Windows NT with a CPU of 800 MHZ and 256 MB RAM(memory).

FIG. 8 shows a plot of the log of execution time against number ofgenes. The trend in this plot is similar to the trends presented in theprevious plots. The plot clearly shows the novelty and power of thepresent algorithm in finding “good” solutions that may be effective inthe gene network reconstruction problem. FIG. 8. also shows thedifference between execution times for the GSSA algorithm and theexhaustive search. The number of genes (size of S) is varied and theexecution times are shown as log (to the base e—natural logarithm).Here, we show time for only 1 run of GSSA (as against 50 runs of thealgorithm in FIGS. 6. and 7.). This shows that this algorithm isextremely fast as compared to exhaustive search methods that are wellknown in the art.

FURTHER APPLICATIONS OF THE INVENTION

The method of invention can easily be extended to probabilisticallydetermine the smallest (in size) subset of features that are closelyrelated to a given target within an a priori set tolerance level interms of a selected quality function. The method then takes as input agiven target (user selected) and iteratively applies the method ofinvention for subsets of size 1,2, 3, . . . , k, etc., until a predictorset that is closely related to the target expression, within the apriori set threshold, is found.

The method of the invention can also be used for classification. Themethod now defines a target in terms of a vector of numbers representingthe class of the experiments under consideration. The method ofinvention can be used to identify a subset of features that can predictthe class of the experiment within the given quality of prediction. Asan example (though not limiting the use of the invention), we canconsider a micro-array data for say, leukemia. The data consists of geneexpression results for different tissues (a tissue sample represents oneexperiment) representing various types of leukemia. If we assign anumber for each type of leukemia (say, 0, 1, 2, etc.), then we candefine a target over all the experiments by the vector of numbersrepresenting the type of leukemia represented by the tissues. We can nowuse the method of invention to identify a “good” subset of genes thatcan predict the target vector (and hence the type of tissues). Hence,these genes can form a diagnostic set of genes whose expression valuesare used to discriminate between the various types of leukemia.

Various modifications to the embodiments of the invention describedabove are, of course, possible. Accordingly, the present invention isnot limited to the particular embodiments described in detail above.

Although the present invention and its advantages have been described indetail, it should be understood that various changes, substitutions andalterations can be made herein without departing from the spirit andscope of the invention as defined by the appended claims. Moreover, thescope of the present application is not intended to be limited to theparticular embodiments of the process, machine, manufacture, compositionof matter, means, methods and steps described in the specification. Asone of ordinary skill in the art will readily appreciate from thedisclosure of the present invention, processes, machines, manufacture,compositions of matter, means, methods, or steps, presently existing orlater to be developed that perform substantially the same function orachieve substantially the same result as the corresponding embodimentsdescribed herein may be utilized according to the present invention.Accordingly, the appended claims are intended to include within theirscope such processes, machines, manufacture, compositions of matter,means, methods, or steps.

1. A method of network reconstruction comprising: (a) selecting atarget; (b) selecting a predictor set of features; (c) adding at leastone complement to said predictor set based on a quality of prediction;(d) checking to see if all of said features are repeated; and (e)removing at least one feature from said predictor set; and as a resultof performing each of steps (a)-(e) at least once, reconstructing anetwork.
 2. A method as recited in claim 1, wherein said target can bechanged and subsets of targets can be formed for predicting otherassociated predictor sets of features.
 3. A method as recited in claim1, further comprising repeating steps (e), (c) and then (d) untildetermined in step (d) that all of said features of said predictor sethave been repeated k times in a row, wherein k corresponds to the numberof features in the predictor set.
 4. A method as recited in claim 3,wherein said predictor set of features is compared to other associatedpredictor sets to determine associated pathways that are used in networkreconstruction.
 5. A method as recited in claim 3, wherein saidpredictor set of features is compared to other non-associated predictorsets to determine associated pathways that are used in networkreconstruction.
 6. A method a recited in claim 3, wherein said predictorset is used in determining clusters.
 7. A method as recited in claim 3,wherein said k features of the predictor set are ordered; and whereinsaid feature that is removed from said predictor set is the firstfeature in the ordered predictor set.
 8. A method as recited in claim 1,wherein said selecting step comprises: selecting k−1 number of featuresat random, wherein k is a number greater than
 1. 9. A method as recitedin claim 1, wherein said features of said predictor set are selected ina defined order.
 10. A method as recited in claim 1, wherein saidfeature that is removed from said predictor set in step (d) is theearliest feature defined in the ordered predictor set.
 11. A method asrecited in claim 1, wherein the predictor set and target are vectors inM-dimensional space, wherein M is a number greater than or equal to 1.12. A method as recited in claim 1, wherein said selected predictor sethas a size of between 1-1000 features.
 13. A method as recited in claim1, wherein said steps of selecting and adding are performed by aprocessor-based device using a first algorithm, and wherein saidchecking step is performed by a separate algorithm.
 14. The method ofclaim 1 wherein said selecting step comprises selecting k−1 featuresassociated with said target to include in said predictor set, wherein kis a number greater than
 1. 15. The method of claim 14 wherein saidadding step comprises adding a complement to said k−1 features to form apredictor set of k features.
 16. The method of claim 15 wherein saidchecking step comprises checking to see if all of said k features ofsaid predictor set have been repeated k times in a row.
 17. The methodof claim 16 wherein said removing step comprises: if determined in step(d) that all of said k features of said predictor set have not beenrepeated k times in a row, removing a feature from said predictor set instep (e) and returning to step (c).
 18. The method of claim 17 furthercomprising: (f) if determined in step (d) that all of said k features ofsaid predictor set have been repeated k times in a row, then determiningsuch predictor set as a best predictor set of k features for predictingthe presence of said target.
 19. The method of claim 18 furthercomprising: ordering the k−1 subset of features in a list.
 20. Themethod of claim 19 wherein said adding said complement to said subsetcomprises: adding said complement to one end of said list.
 21. Themethod of claim 20 wherein said removing said feature from saidpredictor set comprises: removing said feature from the other end ofsaid list.
 22. The method of claim 18 further comprising: determiningwhether the determined best predictor set of k features for predictingthe presence of said target satisfies a predetermined threshold forquality of prediction.
 23. The method of claim 22 wherein if determinedthat the best predictor set of k features for predicting the presence ofsaid target does not satisfy said predetermined threshold, thenincrementing the value of k.
 24. The method of claim 23 furthercomprising: repeating steps (a)-(f) for the incremented value of k. 25.The method of claim 22 if determined that the best predictor set of kfeatures for predicting the presence of said target does not satisfysaid predetermined threshold, performing the following: (g) selectingthe determined best predictor set of k features; and (h) adding at leastone complement feature to said k features to form a new predictor set ofk+1 features.
 26. The method of claim 25 further comprising: (i)checking to see if all of said k+1 features of said new predictor sethave been repeated k+1 times in a row; if determined in step (i) thatall of said k+1 features of said new predictor set have not beenrepeated k+1 times in a row, removing at least one feature from said newpredictor set and returning to step (g); and (k) if determined in step(i) that all of said k+1 features of said new predictor set have beenrepeated k+1 times in a row, then determining such new predictor set asa best predictor set of k+1 features for predicting the presence of saidtarget.
 27. The method of claim 26 further comprising: determiningwhether the determined best predictor set of k+1 features for predictingthe presence of said target satisfies said predetermined threshold forquality of prediction.
 28. The method of claim 1 wherein each of saidfeatures comprises corresponding measurement data.
 29. The method ofclaim 1 wherein said target is a multiple dependency.
 30. The method ofclaim 1 wherein said target is an associated pathway.
 31. Computersoftware, embodied on a computer-readable medium, and operable forperforming network reconstruction, comprising an algorithm that performsthe steps of: (a) selecting a target; (b) selecting a predictor set offeatures; (c) adding a complement to said predictor set based on aquality of prediction; (d) checking to see if all of said features arerepeated; and (e) removing one feature from said predictor set; and as aresult of performing each of steps (a)-(d) at least once, reconstructinga network.
 32. Computer software as recited in claim 31, wherein any ofsaid steps of said algorithm are user defined.
 33. Computer software asrecited in claim 31, wherein any of said steps of said algorithm aresoftware defined.
 34. A method as recited in claim 31, furthercomprising repeating steps (e), (c) and then (d) until determined instep (d) that all of said features of said predictor set have beenrepeated k times in a row, wherein k corresponds to the number offeatures in the predictor set.
 35. The method of claim 34 wherein saidremoving step comprises: if determined in step (d) that all of said kfeatures of said predictor set have not been repeated k times in a row,removing a feature from said predictor set in step (e) and returning tostep (c).
 36. The method of claim 35 further comprising: (f) ifdetermined in step (d) that all of said k features of said predictor sethave been repeated k times in a row, then determining such predictor setas a best predictor set of k features for predicting the presence ofsaid target.
 37. A system for performing network reconstruction,comprising: (a) a computer; and (b) computer software, embodied on acomputer-readable medium, and executable by said computer for performingnetwork reconstruction according to the steps of: (i) selecting apredictor set of features; (ii) adding a complement to said predictorset based on a quality of prediction; (iii) checking to see if all ofsaid features are repeated; and (iv) removing one feature from saidpredictor set.
 38. A method as recited in claim 37, further comprisingrepeating steps (e), (c) and then (d) until determined in step (d) thatall of said features of said predictor set have been repeated k times ina row, wherein k corresponds to the number of features in the predictorset.
 39. The method of claim 38 wherein said removing step comprises: ifdetermined in step (d) that all of said k features of said predictor sethave not been repeated k times in a row, removing a feature from saidpredictor set in step (e) and returning to step (c).
 40. The method ofclaim 39 further comprising: (f) if determined in step (d) that all ofsaid k features of said predictor set have been repeated k times in arow, then determining such predictor set as a best predictor set of kfeatures for predicting the presence of said target.
 41. A method ofnetwork reconstruction comprising: selecting k−1 subset of featuresassociated with a target; ordering the k−1 subset of features in a list;adding to one end of the list a complement feature to form a predictorset of k features; determining whether the features of the predictor sethave appeared k consecutive times; if the features of the predictor sethave not appeared k consecutive times, then removing a feature from theother end of the list; if the features of the predictor set haveappeared k consecutive times, then determining that the predictor set isa best predictor set of k features for predicting the presence of saidtarget; and using the determined best predictor set of k features forpredicting the presence of said target.
 42. The method of claim 38further comprising: adding a second complement feature to the one end ofthe list.
 43. The method of claim 38 further comprising: repeating saidadding, determining, and removing steps until said determining stepdetermines a predictor set of features that have appeared k consecutivetimes.
 44. The method of claim 41 further comprising: using thedetermined best predictor set of k features to determine the associationof multiple dependencies.
 45. A method of network reconstructioncomprising: (a) selecting a subset of (k−x) features associated with atarget, k being a number greater than 1 and x being a number greaterthan 0 and less than k; (b) adding x complement features to said subsetto form a predictor set of k features; (c) checking to see if all ofsaid k features of said predictor set have been repeated k times in arow; (d) if determined in step (c) that all of said features of saidpredictor set have not been repeated k times in a row, removing at leastone feature from said predictor set and returning to step (b); and (e)if determined in step (c) that all of said features of said predictorset have been repeated k times in a row, then determining such predictorset as a best predictor set of k features for predicting said target.46. The method of claim 45 further comprising: using the determined bestpredictor set of k features to determine the association of multipledependencies.
 47. The method of claim 45 wherein if determined thatusing the best predictor set of k features to determine the associationof multiple dependencies does not satisfy a predetermined quality ofprediction threshold, then incrementing the value of k.
 48. The methodof claim 45 wherein said target is a multiple dependency.
 49. The methodof claim 45 wherein said target is an associated pathway.